Optimal phylogenetic reconstruction of insertion and deletion events

Abstract Motivation Insertions and deletions (indels) influence the genetic code in fundamentally distinct ways from substitutions, significantly impacting gene product structure and function. Despite their influence, the evolutionary history of indels is often neglected in phylogenetic tree inference and ancestral sequence reconstruction, hindering efforts to comprehend biological diversity determinants and engineer variants for medical and industrial applications. Results We frame determining the optimal history of indel events as a single Mixed-Integer Programming (MIP) problem, across all branch points in a phylogenetic tree adhering to topological constraints, and all sites implied by a given set of aligned, extant sequences. By disentangling the impact on ancestral sequences at each branch point, this approach identifies the minimal indel events that jointly explain the diversity in sequences mapped to the tips of that tree. MIP can recover alternate optimal indel histories, if available. We evaluated MIP for indel inference on a dataset comprising 15 real phylogenetic trees associated with protein families ranging from 165 to 2000 extant sequences, and on 60 synthetic trees at comparable scales of data and reflecting realistic rates of mutation. Across relevant metrics, MIP outperformed alternative parsimony-based approaches and reported the fewest indel events, on par or below their occurrence in synthetic datasets. MIP offers a rational justification for indel patterns in extant sequences; importantly, it uniquely identifies global optima on complex protein data sets without making unrealistic assumptions of independence or evolutionary underpinnings, promising a deeper understanding of molecular evolution and aiding novel protein design. Availability and implementation The implementation is available via GitHub at https://github.com/santule/indelmip.


Supplementary Material
The details of the real datasets used in this paper are shown in Table S1.

TrAVIS synthetic data generation
This section outlines the details around how synthetic data is generated using TrAVIS, which forms part of the GRASP-suite Foley et al. (2022).TrAVIS samples the Gamma distribution γ(κ, θ) (where κ ∈ {0.5, 1, 2} defines the shape and θ = 0.2 the scale) to set distances on each branch (in turn normalised to the mean δ), bi-furcating each branch point until the specified number of sequences have been mapped as leaves.
For each tree, substitutions, insertions and deletions were randomly introduced at each branch point.Starting at the root with an arbitrary amino acid sequence with a length 80 or longer, the sequence at each of its children is determined recursively.For a sequence, each position is considered as a possible site for a mutation as a Poisson process with a relative rate 2r (r = 1 for all our synthetic data); following the recipe given by Cartwright Cartwright ( 2008), η = e −2rt is the probability that no indel has occurred, which implies a substitution (determined via a model like that suggested by Le and Gasquel Le and Gascuel ( 2008), for instance); the probability of an insertion is (1 − η)/2 which is equivalent to that of a deletion.Finally, the length of the insertion (or deletion) is given by a sampling the Poisson distribution f (λ) (where mean λ = 1, which will give 0 or 1 37% of the time, with greater widths less frequent).Depending on the length of the sequence provided at the root, the resulting aligned sequences ranged from 143 to 2606 positions across the 60 trees.

Visualisation of indel events
We were curious to understand how methods differ qualitatively, including how indel events distribute across trees.We developed a visualisation where the root node is designated as having zero indels.Following a post-order traversal of the tree structure, we enumerate subsequent events in relation to the root node and cumulatively show their number on the y-axis.The x-axis signifies the cumulative level of the internal node relative to the root at 0. We show the predicted indel events for RNaseZ 243 in Figure S1 illustrating no particular bias and the overall trend that MIP includes least events.

Results on synthetic datasets
Indel scores for synthetic datasets

Evolutionary cohesiveness in synthetic datasets
Figure S3 shows percentage of branch points where at least one indel state differs from all its descendants and ancestor in 48 synthetic datasets.

Indel footprints in synthetic datasets
Figure S4 shows percentage of non-confirming ancestors in 48 synthetic datasets.

Dataset Detail B3 225
The sequences belong to subclass B3 Metallo-β-lactamases. We use Glyoxalase II as an outgroup.
RNaseZ 243, RNaseZ 624 RNaseZ 243 and RNaseZ 624 are Ribonuclease Z sequences across archaea, bacteria, and eukaryotes either limiting the eukaryotic sequences to expected ortholog of the human elaC 1 protein (RNaseZ 243) or also including ortholog of the longer human elaC 2 (RNaseZ 624).We use PhnP sequences, a phosphodiesterase, as an outgroup.

KARI 716
The sequences belong to Class I ketol-acid reductoisomerases.
We place the tree root on the branch between the archaeal and bacterial sequences.

ALPHA 1263
The sequences belong to 38 species and 1263 non-redundant sequences from the Alphavirus genus (Family Togaviridae).The tree is rooted at the midpoint.

CYP2CE 1656
The sequences belong to the Cytochrome P450 subfamilies CYP2C and CYP2E.
We use vertebrate CYP2 family as an outgroup.
We use 6-Phosphogluconate dehydrogenase sequences as an outgroup.
ALS 1990 ALS 1990 are acetolactate synthases and we use a small selection of anabolic acetohydroxy-acid synthases as an outgroup.

Hyperparameter experiments
We conducted experiments to gauge the impact of hyperparameter α on indel scores for the synthetic dataset t300d0.5s0.5 (see Figure S6).The indel score peaks when α is zero since there is no penalty for gap opening.As α increases, the indel score decreases, and then levels off.Comparison with Snir and Pachter's method.
Here we present the results of experiments from running indel inference on very small synthetic phylogenetic trees generated by tool TrAVIS using δ of 1.0 and κ of 1.1.We generate datasets by varying the number of extants and sequence positions, assigning them corresponding names.For example, the label t10p100 denotes a tree with 10 leaves and sequences containing 100 positions.The MIP Indel inference method is compared against the dynamic programming method proposed by Snir and Pachter (2011).We implemented the SP algorithm and made it available at https://github.com/Tingzhaooo/spindel.

Fig
Fig.S1.Cumulative indel events predicted (y-axis) for the RNaseZ 243 dataset disperse evenly across different levels of the tree (x-axis) for all methods.MIP tends to predict fewer indels (brown lines).

Figure
FigureS5shows indel scores for six synthetic datasets.

Fig. S2 .
Fig. S2.Indel scores for 48 synthetic datasets.Remaining 12 not shown as they report similar trends.

Fig. S3 .
Fig. S3.Percentage of incohesive ancestors for 48 synthetic datasets.Remaining 12 not shown as they report similar trends.

Fig. S4 .
Fig. S4.Percentage of non-confirming ancestors for 48 synthetic datasets.Remaining 12 not shown as they report similar trends.

Table S1 .
Real protein families datasets used for MIP Indel evaluation.

Table S2 .
Information on the remaining real datasets used in this paper (not referenced from Foley et al. (2022)).These datasets are newly created for different protein families.